The original data is stored in the file
lxb-wz2c-m46p (1).xlsx. After excluding the first 100 rows,
a random sample of 100 entries was selected from the remaining data with
seed 2025. Based on the human brain and ground truth, a summary of text
comment of original data was extracted, recorded in total 13 features
Document.ID, Tone, Tone_justify,
Tone_quote, Commenter_Role, Specialty,
Patient_cost, Patient_quality, Provider_pay,
Provider_quality, Other, Justification,
Quote,
Role_category,word_count,Concern_Level,State/Province,
Provider_logistics,Provider_pay_decrease,has_substantive_tone
into a CSV file named fill_sample.csv. it contains 18
discrete variables and 1 continous variables with 100 rows (totaling
1,900 cells). Among them, 16.2% of the cells are empty,meaning no useful
information could be extracted.
Column-wise Missingness:
The Commenter_Role column has a missing rate of 39%.
The State/Province column has a missing rate of 38%.
The Specialty column, which reflects sub-identity of
corresponding commenter’s role , is missing in 77% of entries.
The Other column, which captures other extractable information
from comments, has a 77% missing rate.
Row-wise Missingness:
the 4 rows with the most missing values share one common trait: their
tone is classified as “very negative”.
the 4 rows with the full information (Rows 15, 37, 69, and 83) share one
common trait: their roles are tend to not be non-professional
roles, e.g.patient, suggesting that professional identity may
correlate with more complete information.
Description:
The tone of a comment is categorized as
very_negative, negative, neutral, or
positive.
The has_substantive_tone of a comment indicates whether
there is a substantive explanation rather than just expressing
emotions.
The tone_justify provides the reasoning behind this
classification based on linguistic metrics. These include
negative_word (presence of words with negative connotation
like not, destroy), strong_word (forceful
terms like must, stop), imperative (use
of imperative sentences), capitalization (use of
capitalization for emphasis), punctuation (presence of
punctuation mark such as ! or ?),
implicature (present of indirect expression or implied
negativity ), fact-based (whether is fact-based), and
supportive (whether the comment expresses encouragement or
approval). high_frequency (it measures how often these
features occur within the comment, either by calculating the proportion
of sentences containing such features or noting repeated use of the same
indicator).
Tone Distribution:
“Percentage of Each Category in Tone” pie chart shows Negative and very
negative tones have roughly equal proportions (44% and 47%,
respectively). In contrast, neutral and positive tones are much less
common (8% neutral, 1% positive). This indicates that the majority of
people in the sample express a negative attitude toward PE.
Correlation between Tone and Tone_justify:
According to the chi-square test \(p<0.05\), Tone-justify is statistically
significant associated with Tone, showing the tone_justify we defined is
reasonable. Heatmap which based on scaled Contingency table, shows the
proportion for each Tone_justidy in the Tone. It could be explored
vertically and horizontally, combined with pie charts.
Vertically:
In the negative tone category, the
negative_word metric is the most dominant, while
capitalization, implicature,
strong_word, objective, and
supportive appear least frequently.
In the very negative tone, punctuation and
capitalization contribute the most.
In neutral and positive tones, only
objective and supportive appear,
respectively, as the sole contributing metrics.
Horizontally :
In negative_word, negative tone contributed
most, followed by very negative.
In strong_word, high_frequency,
capitalization, and implicature,
very negative tone contributed most.
In neutral and positive tones, only
objective and supportive appear,
respectively, as the sole contributing metrics.
In imperative, negative tone contributes
most.
In punctuation, negative tone contributes
most.
Explore has_substantive_tone :
When summarizing the comments, we noticed that some of the comments did
not contain much substantive information. Therefore, we are now
examining the characteristics of those comments that lack substantive
content, including tone, comment role, and concern level. we find most
of them are very_negative tone. Most of them are shows no_concern level,
All of them don’t show the commenter role. In conclusion,
Non-substantive_tone comment would like to contain less useful
information.
below are the examples of Tone Quote - FTC-2024-0022-0951: Very Negative > “It is WRONG and destructive to quality of care.”,“PLEASE DO NOT LET OUT OF CONTROL GREED DESTROY MEDICAL CARE IN THIS COUNTRY!!”,“Medicine should focus on CARE, not SHAREHOLDER OR OWNER PROFIT MAKING!!!”
FTC-2024-0022-1656: Negative > “Private Equity firms, large insurance companies have made working in healthcare extremely unfulfilling with constant pressures to meet unattainable Patient volumes and constant cutting of staff and services with expectation.”,“making it difficult to care”
FTC-2024-0022-1432: Neutral > “hat really got us to where we are today is government regulation, Medicare reimbursement to physicians, and health insurance company mergers and concentration and their corresponding clout to increase requirements for payment and decreasing insurer reimbursement to physicians”
FTC-2024-0022-1703: Positive > “private equity can increase access to medical care by increasing efficiency; to deem all private equity as ‚Äúbad for medicine‚Äù would be too broad”
Description
Based on whether the commenter role is directly affected by PE, I
categorized roles into three groups: "patient",
"provider", and "other".For example,
pharmacists and providers are grouped together under the provider
category, while organization and attorney roles are grouped under
other.The “patient” and “provider” groups are directly affected by PE
through their involvement in the healthcare delivery chain. In contrast,
the “other” group is indirectly affected by PE via institutional, legal,
or organizational influences.
Role and corresponding specialties
The table “Commenter Role with Specialties” summarizes
all role names along with their corresponding specialty names. In the
specialties column, missing or unavailable information is represented by
a “–”. For example, the role nurse in our data corresponds to
specialties including both “NA” and “register,” where “NA” indicates
that the specialty information could not be directly extracted or
inferred from the text.
Distribution of General Role
"Distribution of General Role"pie chart shows the
proportion of the three groups: patient, provider, and other. The null
category represents missing or unavailable role information.
Distribution of Commenter_Role within the most common
general role groups
"Distribution of Commenter Roles in General Role: xxxx" pie
charts display the distribution of Commenter_Role within the top 1
general role groups.
Distribution of Specialty Within the Most Common
Commenter Role in the Most Common General Role Category
"Top xxxx Commenter Roles: Distribution of Specialty" bar
charts illustrate how specialties are distributed within a top n
Commenter_Role in a given top 1 general role category.
| Commenter Role with Specialties and Total Count | ||
| Commenter_Role | Count | Specialties |
|---|---|---|
| attorney | 2 | - |
| caregiver | 3 | - |
| cilvil_servant | 2 | - |
| nurse | 6 | register |
| organization | 4 | labor_union, ngo, coalition |
| patient | 20 | millitary veteran, kidney transplant recipient, CFO |
| pharmacist | 3 | retired |
| physician | 12 | emergency, anesthesiologist, terminated, rheumatology |
| provider | 6 | emergency |
| psychotherapist | 3 | emergency, terminated |
[1] “physician” “nurse” “provider” “pharmacist”
[5] “psychotherapist”
These comments lack clear identifying information, making it
difficult to classify the commenter roles. So, most of them are labeled
as NA in Commenter_role
FTC-2024-0022-1916: Cannot determine the
commenter role; role inference is difficult
Comment:
> As someone who lived in Europe for 5 years, I can say for certain
that the US has the best health care and the most highly trained doctors
in the world. I moved back to the US specifically because I needed
better healthcare than I could find anywhere in Europe. Why would we
want to degrade that so that a tiny minority of people can reap huge
profits?? I am glad to see the government recognize the problems
presented by consolidation in health care industries, particularly
emergency rooms. Private equity firms often prioritize profits over
patient care and worker safety, leading to higher costs, reduced quality
of service, and increased risks for both patients and healthcare
workers. I urge you to consider implementing policies that would limit
or restrict the ability of private equity firms to acquire and operate
emergency rooms. This could include stricter regulations on mergers and
acquisitions in the healthcare industry, as well as greater transparency
and accountability requirements for these firms. Thank you.
FTC-2024-0022-0245: Cannot determine the
commenter role at all; text too short
Comment:
> Private equity has no business to be involved in the Medical
Professions.
FTC-2024-0022-1320: Commenter role unclear, but
“MD” appears in the name
Comment:
> It is unthinkable that the insurance company could own the hospital
or practice, thus, employing the doctors that they already stranglehold
with insurance denials. This is the definition of a nefarious
monopoly!
FTC-2024-0022-2091: Commenter role uncertain;
possibly nurse (mentions nursing)
Comment:
> Private Equity firms, large insurance companies have made working
in healthcare extremely unfulfilling with constant pressures to meet
unattainable patient volumes and constant cutting of staff and services
with expectation. Current employees will continue to do more. Nursing is
no longer about taking care of patients, but rather being able to check
enough boxes on a computer screen. Large insurance companies that have
constant denials on medication that are FDA approved for their condition
make it difficult to care for patients in a manner that follows
standards of care.
Chi-square test indicates that Commenter_Role is
statistically significantly associated with the Tone of the
comment (\(p<0.05\)).
The Contingency Table Heatmap shows counts for each combination
of Role ("patient", "provider",
"other", "no information") and Tone category
("negative", "very negative",
"neutral", "positive").
The histogram showing the distribution of tone within each role category reveals the following:
For commenters with no information on Role, contributions are relatively highest in the negative and very negative tones. Combining this with the longitudinal interpretation—that those directly experiencing PE tend to show negative or very negative tones—this suggests that many commenters without role information may actually belong to the patient or provider groups if forced to classify within patient, provider, or other.
For providers and patients, both groups are contributed more by negative or very negative tones compared to neutral or positive tones. Moreover, providers are contributed by more to the negative tone than the very negative tone compared to patients. This may be due to providers’ professional training and occupational discipline, which help them better regulate emotional expression when describing issues.
For Other, contribution from positive tone is the least, while contributions from negative, very negative, and neutral tones show no marked imbalance. This suggests that people who do not directly experience PE tend not to hold supportive attitudes toward PE, indicating that PE’s impact is broad and its societal effect is primarily negative.
word_count features the comment text length. For the comments recorded in the attached files, I count the words manually. For the comments recorded in the original data cell, I count the words through function str_count() in R.
| Overall Summary Word Count | ||||||||
| count | missing | min | q1 | median | mean | q3 | max | sd |
|---|---|---|---|---|---|---|---|---|
| 100 | 0 | 1 | 46.25 | 125.5 | 333.41 | 226.25 | 6297 | 806.8054 |
| Summary Word Count by Tone | ||||||||
| Tone | count | min | q1 | median | mean | q3 | max | sd |
|---|---|---|---|---|---|---|---|---|
| negative | 44 | 1 | 77.25 | 142 | 387.7727 | 271.75 | 6297 | 1002.44962 |
| neutral | 8 | 3 | 3.00 | 12 | 37.8750 | 21.25 | 227 | 76.89592 |
| positve | 1 | 3 | 3.00 | 3 | 3.0000 | 3.00 | 3 | NA |
| very_negative | 47 | 3 | 68.00 | 130 | 339.8511 | 173.50 | 3724 | 663.49463 |
| Kruskal-Wallis Test for Word Count by Tone | ||
| Statistic | Df | P_value |
|---|---|---|
| 13.693 | 3 | 0.003354 |
| Dunn's Test for Word Count by Tone | |||
| Comparison | Z | P.unadj | P.adj |
|---|---|---|---|
| negative - neutral | 3.3397435 | 0.000838558 | 0.005031 |
| negative - positve | 1.7401757 | 0.081828165 | 0.490969 |
| neutral - positve | 0.4489645 | 0.653457281 | 1.000000 |
| negative - very_negative | 0.7190723 | 0.472096352 | 1.000000 |
| neutral - very_negative | -2.9618691 | 0.003057778 | 0.018347 |
| positve - very_negative | -1.5921502 | 0.111350954 | 0.668106 |
| Summary Word Count by Role_category | ||||||||
| Role_category | count | min | q1 | median | mean | q3 | max | sd |
|---|---|---|---|---|---|---|---|---|
| other | 11 | 1 | 3.0 | 6.0 | 260.4545 | 206.00 | 1741 | 521.6973 |
| patient | 20 | 3 | 138.5 | 178.5 | 258.7000 | 264.25 | 1199 | 262.9307 |
| provider | 30 | 3 | 85.5 | 135.0 | 305.1000 | 283.50 | 2720 | 507.3077 |
| NA | 39 | 3 | 31.5 | 78.0 | 414.0769 | 132.50 | 6297 | 1175.3598 |
| Kruskal-Wallis Test for Word Count by Role_category | ||
| Statistic | Df | P_value |
|---|---|---|
| 4.455 | 2 | 0.1078 |
| Dunn's Test for Word Count by Role_category | |||
| Comparison | Z | P.unadj | P.adj |
|---|---|---|---|
| other - patient | -2.1099855 | 0.0348596 | 0.1046 |
| other - provider | -1.4864119 | 0.1371702 | 0.4115 |
| patient - provider | 0.9287684 | 0.3530091 | 1.0000 |
Concern_Level, based on four
binary indicators: Patient_cost,
Patient_quality, Provider_pay, and
Provider_quality. I calculated the row-wise sum of these
four variables for each comment, resulting in a score ranging from 0 to
4, labeled as "No Concern" (0), "Low" (1),
"Medium" (2), "High" (3), and
"Highest" (4), interpreted as a measure of “concern
breadth”. To assess whether comment length differs significantly across
these concern levels, I first conducted group-wise descriptive
statistics of word_count(“Summary of Word Count by Concern Level”
table), followed by a Kruskal–Wallis test to detect any overall
differences. Since the Kruskal–Wallis test is significant (\(p<0.05\)), I applied Dunn’s post hoc
pairwise comparisons with Bonferroni correction to identify which
specific levels differed significantly in comment length. it shows that
having Low, Medium, and
High levels of concern were all significantly longer
than those with No Concern (p < 0.05 for all three
comparisons).| Summary of Word Count by Concern Level | ||||||||
| Concern_Level | count | min | q1 | median | mean | q3 | max | sd |
|---|---|---|---|---|---|---|---|---|
| No Concern | 9 | 3 | 5.00 | 13.0 | 21.11111 | 25.00 | 61 | 20.43554 |
| Low | 16 | 19 | 35.75 | 109.0 | 196.00000 | 144.75 | 1741 | 415.31129 |
| Medium | 37 | 3 | 66.00 | 129.0 | 514.37838 | 271.00 | 6297 | 1220.97753 |
| High | 30 | 66 | 119.75 | 155.5 | 315.33333 | 270.25 | 1805 | 415.47350 |
| Highest | 8 | 1 | 3.00 | 4.5 | 190.37500 | 297.25 | 756 | 295.83873 |
| Kruskal-Wallis Test for Word Count by Concern_Level | ||
| Statistic | Df | P_value |
|---|---|---|
| 23.376 | 4 | 0.0001065 |
| Dunn's Test for Word Count by Concern_Level | |||
| Comparison | Z | P.unadj | P.adj |
|---|---|---|---|
| High - Highest | 2.4754416 | 0.0133071529 | 0.1330715 |
| High - Low | 2.1517089 | 0.0314202915 | 0.3142029 |
| Highest - Low | -0.7364734 | 0.4614426671 | 1.0000000 |
| High - Medium | 1.2797534 | 0.2006318712 | 1.0000000 |
| Highest - Medium | -1.7198786 | 0.0854545159 | 0.8545452 |
| Low - Medium | -1.1753905 | 0.2398385037 | 1.0000000 |
| High - No Concern | 4.4166745 | 0.0000100231 | 0.0001002 |
| Highest - No Concern | 1.4273936 | 0.1534664769 | 1.0000000 |
| Low - No Concern | 2.4299783 | 0.0150997288 | 0.1509973 |
| Medium - No Concern | 3.6704134 | 0.0002421585 | 0.0024216 |
Description
we are interested in the relationship between state and cocern level.
So, firstly I show the distribution of State. Then I construct a
contingency table with heat map to counts the numbers entries in each
categories (state vs concern level). Based on that construct the
chi-square test to see if the correlation between state and concern
level is significant. lastly, we want to see, for each concern level,
the distribution of states. We also want to see, for each concern
category, the distribution of states.
Distribution of State
The four states that made the most comments are NY, IL, NC and
FL.
significance of correlation
P-value is \(0.1912\). So the sample
does not show strong correlation between State and concern
level
Distribution of State in each concern
level
Distribution of State in each concern
categories
Patient_Cost, FL and NC contribute most, each with
13.3%Patient_quality,FL,IL,NC,and NY contribute most,
each with 8.3%Provider_pay, CT, IL,MD,NY,MI,OH,and UT contributed
most, each with 14.3%Porvider_quality, NY,IL,FL contribute most
(12.8%,7.7%,7.7%)
Pearson’s Chi-squared test
data: contingency_matrix X-squared = 120.64, df = 108, p-value = 0.1912
#write.csv(df,"100_entries_sample.csv")
#sample_50=read.csv("100_entries_sample.csv")
#sample_50=sample_50[1:50,]
#sample_50
#write.csv(sample_50,"50_out_100_sample.csv")